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Standard Gibbs sampling applied to a multivariate normal distribution with a specified preci¬ 
sion matrix is equivalent in fundamental ways to the Gauss-Seidel iterative solution of linear 
equations in the precision matrix. Specifically, the iteration operators, the conditions under 
which convergence occurs, and geometric convergence factors (and rates) are identical. These 
results hold for arbitrary matrix splittings from classical iterative methods in numerical linear 
algebra giving easy access to mature results in that field, including existing convergence results 
for antithetic-variable Gibbs sampling, REGS sampling, and generalizations. Hence, efficient 
deterministic stationary relaxation schemes lead to efficient generalizations of Gibbs sampling. 
The technique of polynomial acceleration that significantly improves the convergence rate of 
an iterative solver derived from a symmetric matrix splitting may be applied to accelerate the 
equivalent generalized Gibbs sampler. Identicality of error polynomials guarantees convergence 
of the inhomogeneous Markov chain, while equality of convergence factors ensures that the 
optimal solver leads to the optimal sampler. Numerical examples are presented, including a 
Chebyshev accelerated SSOR Gibbs sampler applied to a stylized demonstration of low-level 
Bayesian image reconstruction in a large 3-dimensional linear inverse problem. 

Keywords: Bayesian inference, Gaussian Markov random field, Gibbs sampling, matrix splitting, 
multivariate normal distribution, non-stationary stochastic iteration, polynomial acceleration 


1. Introduction 

The Metropolis-Bastings algorithm for MCMC was introduced to main-stream statis¬ 
tics around 1990 (Robert and Casella, 2011), though prior to that the Gibbs sampler 
provided a coherent approach to investigating distributions with Markov random held 
structure (Turcin, 1971; Grenander, 1983; Geman and Geman, 1984; Gelfand and Smith, 
1990; Besag and Green, 1993; Sokal, 1993). The Gibbs sampler may be thought of as 
a particular Metropolis-Bastings algorithm that uses the conditional distributions as 
proposal distributions, with acceptance probability always equal to 1 (Geyer, 2011). 
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In statistics the Gibbs sampler is popular because of ease of implementation (see, e.g., 
Roberts and Sahu, 1997), when conditional distributions are available in the sense that 
samples may be drawn from the full conditionals. However, the Gibbs sampler is not 
often presented as an efficient algorithm, particularly for massive models. In this work 
we show that generalized and accelerated Gibbs samplers are contenders for the fastest 
sampling algorithms for normal target distributions, because they are equivalent to the 
fastest algorithms for solving systems of linear equations. 

Almost all current MGMG algorithms, including Gibbs samplers, simulate a fixed 
transition kernel that induces a homogeneous Markov chain that converges geometrically 
in distribution to the desired target distribution. In this aspect, modern variants of 
the Metropolis-Hastings algorithm are unchanged from the Metropolis algorithm as first 
implemented in the 1950’s. The adaptive Metropolis algorithm of Haario et al. (2001) (see 
also Roberts and Rosenthal, 2007) is an exception, though it converges to a geometrically 
convergent Metropolis-Hastings algorithm that bounds convergence behaviour. 

We focus on the application of Gibbs sampling to drawing samples from a multivari¬ 
ate normal distribution with a given covariance or precision matrix. Our concern is to 
develop generalized Gibbs samplers with optimal geometric, or better than geometric, 
distributional convergence by drawing on ideas in numerical computation, particularly 
the mature field of computational linear algebra. We apply the matrix-splitting formalism 
to show that fixed-scan Gibbs sampling from a multivariate normal is equivalent in fun¬ 
damental ways to the stationary linear iterative solvers applied to systems of equations 
in the precision matrix. 

Stationary iterative solvers are now considered to be very slow precisely because of 
their geometric rate of convergence, and are no longer used for large systems. However, 
they remain a basic building block in the most efficient linear solvers. By establishing 
equivalence of error polynomials we provide a route whereby acceleration techniques from 
numerical linear algebra may be applied to Gibbs sampling from normal distributions. 
The fastest solvers employ non-stationary iterations, hence the equivalent generalized 
Gibbs sampler induces an inhomogeneous Markov chain. Explicit calculation of the error 
polynomial guarantees convergence, while control of the error polynomial gives optimal 
performance. 

The adoption of the matrix splitting formalism gives the following practical benefits 
in the context of fixed-scan Gibbs sampling from normal targets: 

1. a one-to-one equivalence between generalized Gibbs samplers and classical linear 
iterative solvers; 

2. rates of convergence and error polynomials for the Markov chain induced by a 
generalized Gibbs sampler; 

3. acceleration of the Gibbs sampler to induce an inhomogeneous Markov chain that 
achieves the optimal error polynomial, and hence has optimal convergence rate; 

4. numerical estimates of convergence rate of the Gibbs sampler in a single chain and 
a priori estimates of number of iterations to convergence; 

5. access to preconditioning, whereby the sampling problem is transformed into an 
equivalent problem for which the accelerated Gibbs sampler has improved conver- 
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gence rate. 

Some direct linear solvers have already been adapted to sampling from multivari¬ 
ate normal distributions, with Rue (2001) demonstrating the use of solvers based on 
Cholesky factorization to allow computationally efficient sampling. This paper extends 
the connection to the iterative linear solvers. Since iterative methods are the most ef¬ 
ficient for massive linear systems, the associated samplers will be the most efficient for 
very high dimensional normal targets. 

1.1. Context and overview of resnlts 

The Cholesky factorization is the conventional way to produce samples from a moderately 
sized multivariate normal distribution (Rue, 2001; Rue and Held, 2005), and is also the 
preferred method for solving moderately sized linear systems. For large linear systems, 
iterative solvers are the methods of choice due to their inexpensive cost per iteration, 
and small computer memory requirements. 

Gibbs samplers applied to normal distributions are essentially identical to station¬ 
ary iterative methods from numerical linear algebra. This connection was exploited by 
Adler (1981), and independently by Barone and Frigessi (1990), who noted that the 
component-wise Gibbs sampler is a stochastic version of the Gauss-Seidel linear solver, 
and accelerated the Gibbs sampler by introducing a relaxation parameter to implement 
the stochastic version of the successive over-relaxation (SOR) of Gauss-Seidel. This pair¬ 
ing was further analyzed by Goodman and Sokal (1989). 

This equivalence is depicted in panels A and B of Figure 1. Panel B shows the contours 
of a normal density 7r(x), and a sequence of coordinate-wise conditional samples taken 
by the Gibbs sampler applied to tt. Panel A shows the contours of the quadratic minus 
log(7r(x)) and the Gauss-Seidel sequence of coordinate optimizations^, or, equivalently, 
solves of the normal equations Vlog7r(x) = 0. Note how in Gauss-Seidel the step sizes 
decrease towards convergence, which is a tell-tale sign that convergence (in value) is 
geometric. In Section 4 we will show that the iteration operator is identical to that of the 
Gibbs sampler in panel B, and hence the Gibbs sampler also converges geometrically (in 
distribution). Slow convergence of these algorithms is usually understood in terms of the 
same intuition; high correlations correspond to long narrow contours, and lead to small 
steps in coordinate directions and many iterations being required to move appreciably 
along the long axis of the target function. 

Roberts and Sahu (1997) considered forward then backward sweeps of coordinate-wise 
Gibbs sampling, with relaxation parameter, to give a sampler they termed the REGS 
sampler. This is a stochastic version of the symmetric-SOR (SSOR) iteration, which 
comprises forward then backward sweeps of SOR. 

The equality of iteration operators and error polynomials, for these pairs of fixed- 
scan Gibbs samplers and iterative solvers, allows existing convergence results in nu¬ 
merical analysis texts (for example Axelsson, 1996; Golub and Loan, 1989; Nevanlinna, 

^Gauss-Seidel optimization was rediscovered by Besag (1986) as iterated conditional modes. 
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Solving Ax = b 

A: Gauss-Seidel 



C: Chebyshev-SSOR 



E: CG 



Sampling from N(/x, A 
B: Gibbs 




F: GG Gibbs 



Figure 1. The panels in the left column show the contours of a quadratic function ^x^Ax — b^x in 
two dimensions and the iteration paths for some common optimizers towards the minimizer /x = A“^b, 
or equivalently the path of iterative linear solvers of Ax = b. The right column presents the iteration 
paths of the samplers corresponding to each linear solver, along with the contours of the normal density 
kexp { — ^x^Ax + b^x}, where k is the normalizing constant. In all panels, the matrix A has eigenpairs 
{(lO, [1 1]^) , (l, [1 — 1]^)}- The Gauss-Seidel solver took 45 iterations to converge to /x (shown are 
the 90 coordinate steps; each iteration is a “sweep” of the two coordinate directions), the Chebyshev 
polynomial accelerated SSOR required just 16 iterations to converge, while CG finds the minimizer in 2 
iterations. For each of the samplers 10 iterations are shown (the 20 coordinate steps are shown for the 
Gibbs sampler). The correspondence between these linear solvers/optimizers and samplers is treated in 
the text (CG in supplementary material). 
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1993; Saad, 2003; Young, 1971) to be used to establish convergence results for the cor¬ 
responding Gibbs sampler. Existing results for rates of distributional convergence by 
hxed-sweep Gibbs samplers (Adler, 1981; Barone and Frigessi, 1990; Liu et ah, 1995; 
Roberts and Sahu, 1997) may be established this way. 

The methods of Gauss-Seidel, SOR, and SSOR, give stationary linear iterations that 
were used as linear solvers in the 1950’s, and are now considered very slow. The corre¬ 
sponding fixed-scan Gibbs samplers are slow for precisely the same reason. The last fifty 
years has seen an explosion of theoretical results and algorithmic development that have 
made linear solvers faster and more efficient, so that for large problems, stationary meth¬ 
ods are used as preconditioners at best, while the method of preconditioned conjugate 
gradients, GMRES, multigrid, or fast-multipole methods are the current state-of-the-art 
for solving linear systems in a finite number of steps (Saad and van der Vorst, 2000). 

Linear iterations derived from a symmetric splitting may be sped up by polynomial 
acceleration, particularly Ghebyshev acceleration that results in optimal error reduction 
amongst methods that have a fixed non-stationary iteration structure (Fox and Parker, 
1968; Axelsson, 1996). The Ghebyshev accelerated SSOR solver and corresponding Gheby¬ 
shev accelerated SSOR sampler (Fox and Parker, 2014) are depicted in panels C and D 
of Figure 1. Both the solver and sampler take steps that are more aligned with the long 
axis of the target, compared to the coordinate-wise algorithms, and hence achieve faster 
convergence. However, the step size of Ghebyshev-SSOR solving still decreases towards 
convergence, and hence convergence for both solver and sampler is still asymptotically 
geometric, albeit with much improved rate. 

Fox and Parker (2014) considered point-wise convergence of the mean and variance of 
a Gibbs SSOR sampler accelerated by Ghebyshev polynomials. In this paper we prove 
convergence in distribution for Gibbs samplers corresponding to any matrix splitting and 
accelerated by any polynomial that is independent of the Gibbs iterations. We then apply 
a polynomial accelerated sampler to solve a massive Bayesian linear inverse problem that 
is infeasible to solve using conventional techniques. 

Ghebyshev acceleration requires estimates of the extreme eigenvalues of the error op¬ 
erator, which we obtain via a conjugate-gradient (GG) algorithm at no significant com¬ 
putational cost (Meurant, 2006). The GG algorithm itself may be adapted to sample 
from normal distributions; the GG solver and corresponding sampler, depicted in panels 
E and F of Figure 1, were analysed by Parker and Fox (2012) and is discussed in the 
supplementary material. 


1.2. Structure of the paper 

In Section 2 we review efficient methods for sampling from normal distributions, high¬ 
lighting Gibbs sampling in various algorithmic forms. Standard results for stationary 
iterative solvers are presented in Section 3. Theorems in Section 4 establish equivalence 
of convergence and convergence factors for iterative solvers and Gibbs samplers. Appli¬ 
cation of polynomial acceleration methods to linear solvers and Gibbs sampling is given 
in Section 5, including a proof of convergence of the first and second moments of a poly- 
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nomial accelerated sampler. Numerical verification of convergence results is presented in 
Section 6. 


2. Sampling from multivariate normal distributions 

We consider the problem of sampling from an n-dimensional normal distribution N(/x, S) 
defined by the mean n-vector /x, and the n x n symmetric and positive definite (SPD) 
covariance matrix S. Since if z ~ N(0, S) then z + /x ~ N(/x,S), it often suffices to 
consider drawing samples from normal distributions with zero mean. An exception is 
when /X is defined implicitly, which we discuss in section 4.1. 

In Bayesian formulations of inverse problems that use a GMRF as a prior distri¬ 
bution, typically the precision matrix A = is explicitly modeled and available 

(Rue and Held, 2005; Higdon, 2006), perhaps as part of a hierarchical model (Banerjee et ah, 
2003). Typically then the precision matrix (conditioned on hyperparameters) is large 
though sparse, if the neighborhoods specifying conditional independence are small. We 
are particularly interested in this case, and throughout the paper will focus on sam¬ 
pling from N(0, A~^) when A is sparse and large, or when some other property makes 
operating by A easy, i.e., one can evaluate Ax for any vector x. 

Standard sampling methods for moderately sized normal distributions utilize the 
Cholesky factorization (Rue, 2001; Rue and Held, 2005) since it is fast, incurring approxi¬ 
mately (l/3)n^ floating point operations (flops) and is backwards stable (Watkins, 2002). 
Samples can also be drawn using the more expensive eigen-decomposition (Rue and Held, 
2005), that costs approximately (10/3)n^ flops, or more generally using mutually conju¬ 
gate vectors (Fox, 2008; Parker and Fox, 2012). For stationary Gaussian random fields 
defined on the lattice, Fourier methods can lead to efficient sampling for large prob¬ 
lems (Gneiting et ah, 2005). 

Algorithm 1 shows the steps for sampling from N(0, S) using Cholesky factorization, 
when the covariance matrix S is available (Neal, 1997; MacKay, 2003; Higdon, 2006). 

Algorithm 1: Cholesky sampling using a covariance matrix S 

input : Covariance matrix S 

output: y ~ N(0, S) 

Cholesky factor S = CC^; 

sample z ^ N(0,I); 

y = Cz; 

When the precision matrix A is available, a sample y ~ N(0, A~^) may be drawn 
using Algorithm 2 given by Rue (2001) (see also Rue and Held, 2005). 
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Algorithm 2: Cholesky sampling using a precision matrix A 
input : Precision matrix A 
output: y ~ N(0, A^^) 

Cholesky factor A = BB^; 

sample z ~ N(0,I); 

solve B’^y = z by back substitution; 

The computational cost of Algorithm 2 depends on the bandwidth of A, that also 
bounds the bandwidth of the Cholesky factor B. For a bandwidth b, calculation of 
the Cholesky factorization requires OilPn) flops, which provides savings over the full- 
bandwidth case when b n/2 (Golub and Loan, 1989; Rue, 2001; Watkins, 2002). For 
GMRF’s defined over 2-dimensional domains, the use of a bandwidth reducing permu¬ 
tation often leads to substantial computational savings (Rue, 2001; Watkins, 2002). In 
3-dimensions and above, typically no permutation exists that can significantly reduce 
the bandwidth below hence the cost of sampling by Cholesky factoring is at least 
flops. Further, Cholesky factorizing requires that the precision matrix and the 
Cholesky factor be stored in computer memory, which can be prohibitive for large prob¬ 
lems. In Section 6 we give an example of sampling from a large GMRF for which Cholesky 
factorization is prohibitively expensive. 

2.1. Gibbs sampling from a normal distribution 

Iterative samplers, such as Gibbs, are an attractive option when drawing samples from 
high dimensional multivariate normal distributions due to their inexpensive cost per iter¬ 
ation and small computer memory requirements (only vectors of size n need be stored). 
If the precision matrix is sparse with 0{n) non-zero elements, then, regardless of the 
bandwidth, iterative methods cost only about 2n flops per iteration, which is compara¬ 
ble with sparse Cholesky factorizations. However, when the bandwidth is 0{n), the cost 
of the Cholesky factorization is high at 0{n^) flops, while iterative methods maintain 
their inexpensive cost per iteration. Iterative methods are then preferable when requiring 
significantly fewer than 0(n^) iterations for adequate convergence. In the examples pre¬ 
sented in section 6 we find that 0{n) iterations give convergence to machine precision, 
so the iterative methods are preferable for large problems. 

2.1.1. Componentwise formulation 

One of the simplest iterative sampling methods is the component-sweep Gibbs sampler 
(Geman and Geman, 1984; Gelman et ah, 1995; Gilks et ah, 1996; Rue and Held, 2005). 
Let y = (yi,?/ 2 , & 5?” denote a vector in terms of its components, and let A be 

an n X n precision matrix with elements {aij}. One sweep over all n components can be 
written as in Algorithm 3 (Barone and Frigessi, 1990), showing that the algorithm can 
be implemented using vector and scalar operations only, and storage or inversion of the 
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precision matrix A is not required. 


Algorithm 3: Component-sweep Gibbs sampling using a precision matrix A 

input : Precision matrix A, initial state , and 

maximum iteration fcmax 

output: , y^^^, • ■ ■, } where y^^'^ ^ N(0, A“^) as fc —>• oo 

for k = l,2,...,A;max do 
for z = 1, 2,..., n do 


Sample z ^ N(0,1); 


(k) z 
Vi = 






-E 


dijUj 


(k) 


y3>i 


j<i 


end 

end 


The index k may be omitted (and with ‘=’ interpreted as assignment) to give an 
algorithm that can be evaluated in place, requiring minimal storage. 

2.1.2. Matrix formulation 

One iteration in Algorithm 3 consists of a sweep over all n components of y*^^^ in sequence. 
The iteration can be written succinctly in the matrix form (Goodman and Sokal, 1989) 

y{k+i) ^ - D^^L^y('=) -f (1) 

where ~ N(0,1), D = diag(A), and L is the strictly lower triangular part of A. This 
equation makes clear that the computational cost of each sweep is about 2n^ flops, when 
A is dense, due to multiplication by the triangular matrices L and L^, and 0{n) flops 
when A is sparse. 

Extending this formulation to sweeps over any other fixed sequence of coordinates 
is achieved by putting PAP"”" in place of A for some permutation matrix P. The use 
of random sweep Gibbs sampling has also been suggested (Amit and Grenander, 1991; 
Fishman, 1996; Liu et ah, 1995; Roberts and Sahu, 1997), though we do not consider 
that here. 


2.1.3. Convergence 

If the iterates y^^^ in (1) converge in distribution to a distribution 11 which is independent 
of the starting state y^®\ then the sampler is convergent, and we write 

y('=) 4 n. 

It is well known that the iterates y^^^ in the Gibbs sampler (1) converge in distribution 
geometrically to N(0, A~^) = N(0, S) (Roberts and Sahu, 1997). We consider geometric 
convergence in detail in Section 4. 
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3. Linear stationary iterative methods as linear 
equation solvers 

Our work draws heavily on existing results for stationary linear iterative methods for 
solving linear systems. Here we briefly review the main results that we will use. 
Consider a system of linear equations written as the matrix equation 


Ax = b (2) 

where A is a given n x n nonsingular matrix and b is a given n-dimensional vector. 
The problem is to find an n-dimensional vector x satisfying equation (2). Later we will 
consider the case where A is symmetric positive definite (SPD) as holds for covariance 
and precision matrices (Feller, 1968). 

3.1. Matrix splitting form of stationary iterative algorithms 

A common class of methods for solving (2) are the linear iterative methods based on a 
splitting of A into A = M — N. The matrix splitting is the standard way of expressing 
and analyzing linear iterative algorithms, following its introduction by Varga (1962). 
The system (2) is then transformed to Mx = Nx -|- b or, if M is nonsingular, x = 
M~^Nx+M~^b. The iterative methods use this equation to compute successively better 
approximations x*^^) to the solution using the iteration step 

x(fe+i) = M-^Nx('=) 4- M-^b = Gx('=) -k g. (3) 

We follow the standard terminology used for these methods (see e.g. Axelsson, 1996; 
Golub and Loan, 1989; Saad, 2003; Young, 1971). Such methods are termed linear sta¬ 
tionary iterative methods (of the first degree); they are stationary^ because the iteration 
matrix G = M“^N and the vector g = M~^b do not depend on k. The splitting is 
symmetric when both M and N are symmetric matrices. The iteration, and splitting, 
is convergent if tends to a limit independent of x*^°\ the limit being A^^b (see, 
e.g. (Young, 1971, Theorem 5.2)). 

The iteration (3) is often written in the residual form so that convergence may be 
monitored in terms of the norm of the residual vector, and emphasizes that is 

acting as an approximation to A^^, as in Algorithm 4. 

^This use of stationary corresponds to the term homogeneous when referring to a Markov chain. 
It is not to be confused with a stationary distribution that is invariant under the iteration. Later we 
will develop non-stationary iterations, inducing a non-homogeneous Markov chain that will, however, 
preserve the target distribution at each iterate. 
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Algorithm 4: Stationary iterative solve of Ax = b 

input : The splitting M, N of A, initial state 
output: x^^^ approximating x* = A”^b 

fc = 0; 

repeat 

rW =b-Ax^'^); 
x(fe+i) = 
increment fc; 

until is sufficiently small; 

In computational algorithms it is important to note that the symbol M~^r is inter¬ 
preted as “solve the system Mu = r for u” rather than “form the inverse of M and 
multiply r by M“^” since the latter is much more computationally expensive (about 2n^ 
flops (Watkins, 2002)). Thus, the practicality of a splitting depends on the ease with 
which one can solve Mu = r for any vector r. 

3.1.1. The Gauss-Seidel algorithm 

Many splittings of the matrix A use the terms in the expansion A = L -|- D -|- U where 
L is the strictly lower triangular part of A, D is the diagonal of A, and U is the strictly 
upper triangular part. 

For example, choosing M = L -|- D (so N = —U) allows Mu = r to be solved by 
“forward substitution” (at a cost of flops when A is dense), and hence does not 
require inversion or Gauss-elimination of M (which would cost 2/3n^ flops when A is 
dense). Using this splitting in Algorithm 4 results in the Gauss-Seidel iterative algorithm. 
When A is symmetric, U = L"*", and the Gauss-Seidel iteration can be written as 

x(fc+i) = + D^^b. (4) 

Just as we pointed out for the Gibbs sampler, variants of the Gauss-Seidel algorithm 
such as “red-black” coordinate updates (Saad, 2003), may be written in this form using 
a suitable permutation matrix. 

The component-wise form of the Gauss-Seidel algorithm can be written in ‘equation’ 
form just as the Gibbs sampler (1) was in Algorithm 3. The component-wise form em¬ 
phasizes that Gauss-Seidel can be implemented using vector and scalar operations only, 
and neither storage nor inversion of the splitting is required. 

3.2. Convergence 

A fundamental theorem of linear stationary iterative methods states that the splitting 
A = M — N, where M is nonsingular, is convergent (i.e., x^^^ —>■ A~^b for any x*^°)) if 
and only if g (M^^N) < 1, where g (•) denotes the spectral radius of a matrix (Young, 
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Table 1. Common stationary linear solvers generated by splittings A = M — N, and conditions that 

guarantee convergence when A is SPD 


splitting 

M 

convergence 

Richardson (R) 

il 

CJ 

e{A) 

A strictly diagonally dominant 

Jacobi (J) 

D 

Gauss-Seidel (GS) 

D + L 

always 

SOR 

I-D-l-L 

0 < tj < 2 

SSOR 


0 < tJ < 2 


1971, Theorem 3.5.1). This characterization is often used as a definition (Axelsson, 1996; 
Golub and Loan, 1989; Saad, 2003). 

The error at step k is — x*. where x* = A^^b. It follows that 

e(fc+i) = (M-iN)'=e(°) (5) 

and hence the asymptotic average reduction in error per iteration is the multiplicative 
factor 

(Axelsson, 1996, p. 166). In numerical analysis this is called the (asymptotic average) 
convergence factor (Axelsson, 1996; Saad, 2003). Later, we will show that this is exactly 
the same as the quantity called the geometric convergence rate in the statistics literature 
(see e.g. Robert and Casella, 1999), for the equivalent Gibbs sampler. We will use the 
term ‘convergence factor’ throughout this paper to avoid a clash of terminology, since in 
numerical analysis the rate of convergence is minus the log of the convergence factor (see 
e.g. Axelsson, 1996, p. 166). 

3.3. Common matrix splittings 

We now present the matrix splittings corresponding to some common stationary linear 
iterative solvers, with details for the case where A is symmetric, as holds for precision 
or covariance matrices. 

We have seen that the Gauss-Seidel iteration uses the splitting Mgs = L + D and 
Nqs = —L^. Gauss-Seidel is one of the simplest splittings and solvers, but is also quite 
slow. Other splittings have been developed, though the speed of each method is often 
problem specific. Some common splittings are shown in Table 1, listed with, roughly, 
greater speed downwards. Speed of convergence in a numerical example is presented 
later in Section 6. 

The method of successive over-relaxation (SOR) uses the splitting 

Msor = — D -I- L and Nsor =-D — iF (7) 

W U) 
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in which a; is a relaxation parameter chosen with 0 < w < 2. SOR with w = 1 is 
Gauss-Seidel. For optimal values of w such that p(MgQj^NsoR) < e(MQgNGs), SOR 
is an accelerated Gauss-Seidel iteration. Unfortunately, there is no closed form for the 
optimal value of lo for an arbitrary matrix A, and the interval of values of lo which admits 
accelerated convergence can be quite narrow (Young, 1971; Golub and Loan, 1989; Saad, 
2003). 

The symmetric-SOR method (SSOR) incorporates both a forward and backward sweep 
of SOR so that if A is symmetric then the splitting is symmetric (Golub and Loan, 1989; 
Saad, 2003), 

Mssor = -MsoRD^^Mgop, and Nssor = -NgQj^D~^NsoR- (8) 

We will make use of symmetric splittings in conjunction with polynomial acceleration in 
Section 5. 

When the matrix A is dense, Gauss-Seidel and SOR cost about 3n^ flops per iteration, 
with 2n^ due to multiplication by the matrix A (in order to calculate the residual) and 
another for the forward substitution to solve Mu = r. Richardson incurs no cost to 
solve Mu = r, while a solve with the diagonal Jacobi matrix incurs n flops. Iterative 
methods are particularly attractive when the matrix A is sparse, since then the cost per 
iteration is only 0{n) flops. 

Many theorems establish convergence of splittings by utilizing properties of A in 
specific applications. Some general conditions that guarantee convergence when A is 
SPD are given in the right column of Table 1 (Golub and Loan, 1989; Saad, 2003; Young, 
1971). 

4. Equivalence of stationary linear solvers and Gibbs 
samplers 

We first consider the equivalence between linear solvers and stochastic iterations in the 
case where the starting state and noise are not necessarily normally distributed, then in 
Section 4.2 et seq. we restrict consideration to normal distributions. 


4.1. General noise 

The striking similarity between the Gibbs sampler (1) and the Gauss-Seidel iteration (4) 
is no coincidence. It is an example of a general equivalence between the stationary linear 
solver derived from a splitting and the associated stochastic iteration used as a sampler. 
We will make explicit the equivalence in the following theorems and corollary. In the 
first theorem we show that a splitting is convergent (in the sense of stationary iterative 
solvers) if and only if the associated stochastic iteration is convergent in distribution. 
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Theorem 1 Let A = M — N be a splitting with M invertible, and let tt (•) be some fixed 
probability distribution with zero mean and fixed non-zero covariance. For any fixed vector 
b, and random vectors tt, = 0,1,2,. . the stationary linear iteration 

x(fe+i) = (9) 

converges, with A^^b as k ^ oo whatever the initial vector if and only if 

there exists a distribution 11 such that the stochastic iteration 

y(fc+i) = (10) 

converges in distribution to H, with ^ 11 as fc —>■ c» whatever the initial state y^*^^. 
Proof. If the linear iteration (9) converges, then p(M^^N) < 1 (Thm 3-5.1 in Young, 
1971). Hence there exists a unique distribution H with y(^+i) ^ H (Theorem 2.3.18-4 of 
Duflo, 1997), which shows necessity. Conversely, if the linear solver does not converge to 
a limit independent of x*^°^ for some b, that also holds for b = 0 and hence initializing the 
sampler with E [y^°^] = x^*’^ yields E [y^^"*"^^] = (M“^N)^x(°^ which does not converge 
to a value independent of . Sufficiency holds by the contrapositive. □ 

Convergence of the stochastic iteration (10) could also be established via the more 
general theory of Diaconis and Ereedman (1999) that allows the iteration operator G = 
M~^N to be random, with convergence in distribution guaranteed when G is contracting 
on average', see Diaconis and Freedman (1999) for details. 

The following theorem shows how to design the noise distribution tt so that the limit 
distribution H has a desired mean p. and covariance S = A~^. 

Theorem 2 Let A be SPD, A ~ M — N be a convergent splitting, fi a fixed vector, 
and TT (•) a fixed probability distribution with finite mean v and non-zero covariance V. 
Consider the stochastic iteration (10) where tt, k = 0,1,2,.... Then, whatever 

the starting state the following are equivalent: 

1. E = u and Var = V = -b N 

2. the iterates y^^'^ converge in distribution to some distribution H that has mean {i = 
A~^i/ and covariance matrix A~^; in particular E [y^^'^] —>■ fx and Var (y^^^) —>■ 
A“^ as k ^ oo. 

Proof. Appendix A.l. □ 

Additionally, the mean and covariance converge geometrically, with convergence fac¬ 
tors given by the convergence factors for the linear iterative method, as established in 
the following corollary. 

Corollary 3 The first and second moments of iterates in the stochastic iteration in 
Theorem 2 converge geometrically. Specifically, E(y(^)) —^ fx with convergence factor 
£)(M~^N) and Var(y*^^)) = A~^ — G^ — Var(y*^°))) (G^')^ —^ A~^ with conver¬ 

gence factor £i(M“^N)^. 
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Table 2. Some generalized Gibbs samplers for drawing from N (O, A adapted from common 
stationary linear solvers. Each Gibbs iteration requires sampling the noise vector ~ N (O, + N) 


Sampler 

M 

Var (c('“)) = -b N 

Richardson 

Jacobi 

Gibbs (Gauss-Seidel) 

SOR 

SSOR (REGS) 

-!-i 

IV 

D 

D -b L 

1 

-D-l-L 

UJ 

MsorD MgoR 

Z — C/J 

-I-A 

U! 

2D - A 

D 

2 — tj 

D 

CJ 

^ (MsorD-ImI’or + NI’orD-INsor) 


Proof. Appendix A.l. □ 

Note that the matrix splitting has allowed an explicit construction of the noise covari¬ 
ance to give a desired precision matrix of the target distribution. We see from Theorem 2 
that the stochastic iteration may be designed to converge to a distribution with non-zero 
target mean, essentially by adding the deterministic iteration (9) to the stochastic iter¬ 
ation (10). This is particularly useful when the mean is defined implicitly via solving a 
matrix equation. In cases where the mean is known explicitly, the mean may be added af¬ 
ter convergence of the stochastic iteration with zero mean, giving an algorithm with faster 
convergence since the covariance matrix converges with factor p(M“^N) 2 < p(M~iN) 
(this was also noted by Barone et ah, 2002). Convergence in variance for non-normal 
targets was considered in Fox and Parker (2014). 

Using Theorems 1 and 2, and Corollary 3 we can draw on the vast literature in numer¬ 
ical linear algebra on stationary linear iterative methods to find random iterations that 
are computationally efficient and provably convergent in distribution with desired mean 
and covariance. In particular, results in Amit and Grenander (1991); Barone and Frigessi 
(1990); Galli and Gao (2001), Roberts and Sahu (1997), and Liu et al. (1995) are special 
cases of the general theory of matrix splittings presented here. 

4.2. Sampling from normal distributions using matrix splittings 

We now restrict attention to the case of normal target distributions. 

Corollary 4 If in Theorem 2 we set tt = N(iz, V), for some non-zero covariance matrix 
V, then, whatever the starting state the following are equivalent: (i) V = M^-|-N; 
(ii) ^ N(/x, A~^) where /x = 

Proof. Since tt is normal, then II in Theorem 2 is normal. Since a normal distribution 
is sufficiently described by its first two moments, the corollary follows. □ 

Using Corollary 4, we found normal sampling algorithms corresponding to some com¬ 
mon stationary linear solvers. The results are given in Table 2. A sampler corresponding 
to a convergent splitting is implemented in Algorithm 5. 
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Algorithm 5: Stationary sampler of N(0, A 
input : SPD precision matrix A, M and N defining a convergent splitting of A, 
number of steps /cmax, initial state 
output: approximately distributed as N(0, A^^) 

for k = 0,..., /cinax do 

sample ~ N(0, + N); 

y(fe+i) = 

end 


The assignment y(^+^) = M~^(Ny(^) + in Algorithm 5 can be replaced by the 
slightly more expensive steps r^^^ = — Ay^^^ and y(^+^) = y(^) + which 

allows monitoring of the residual, and emphasizes the equivalence with the stationary 
linear solver in Algorithm 4. Even though convergence may not be diagnosed by a de¬ 
creasing norm of the residual, lack of convergence can be diagnosed when the residual 
diverges in magnitude. In practice, the effective convergence factor for a sampler may 
be calculated by solving the linear system (2) (perhaps with a random right hand side) 
using the iterative solver derived from the splitting and monitoring the decrease in error 
to evaluate the asymptotic average convergence factor using equation (6). By Corollary 3, 
this estimates the convergence factor for the sampler. 

The practicality of a sampler derived from a convergent splitting depends on the ease 
with which one can solve My = r for any r (as for the stationary linear solver) and 
also the ease of drawing iid noise vectors from N(0, + N). Sampling the noise vector 

is simple when a matrix square root, such as the Cholesky factorization, of -|- N is 
cheaply available. Thus, a sampler is at least as expensive as the corresponding linear 
solver since, in addition to operations in each iteration, the sampler must factor the 
n X n matrix Var(c^'=)) =M^ + N. For the samplers listed in Table 2 it is interesting 
that the simpler the splitting, the more complicated is the variance of the noise. Neither 
Richardson nor Jacobi splittings give useful sampling algorithms since the difficulty of 
sampling the noise vector is no less than the original task of sampling from N(0, A“^). 
The Gauss-Seidel splitting, giving the usual Gibbs sampler, is at a kind of sweet spot, 
where solving equations in M is simple while the required noise variance is diagonal, so 
posing a simple sampling problem. 

The SOR stationary sampler uses the SOR splitting Msor and Nsor in (7) for 
0 < w < 2 and the noise vector ~ N(0,MgQ^ -|- Nsor = ^^D) (Table 2). This 
sampler was introduced by Adler (1981), rediscovered by Barone and Frigessi (1990), 
and has been studied extensively (Barone et ah, 2002; Galli and Gao, 2001; Liu et ah, 
1995; Neal, 1998; Roberts and Sahu, 1997). For w = 1, the SOR sampler is a Gibbs 
(Gauss-Seidel) sampler. For values of oj such that p(MgQj^NsoR) < ^?(Mq 5 Ngs)), the 
SOR-sampler is an accelerated Gibbs sampler. As for the linear solver, implementation 
of the Gibbs and SOR samplers by Algorithm 5 requires multiplication by the upper 
triangular N and forward substitution with respect to M at a cost of 2n^ flops. In 
addition, these samplers must take the square root of the diagonal matrix at a 
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mere cost of 0{n) flops. 

Implementation of an SSOR sampler instead of a Gibbs or SOR sampler is advan¬ 
tageous since the Markov chain is reversible (Roberts and Sahu, 1997). SSOR 

sampling uses the symmetric-SOR splitting Mssor and Nssor in (8). The SSOR sta¬ 
tionary sampler is most easily implemented by forward and backward SOR sampling 
sweeps as in Algorithm 6, so the matrices Mssor and Nssor need never be explicitly 
formed. 

Algorithm 6: SSOR sampling from N(0, A“^) 

input : The SOR splitting M, N of A, relaxation parameter w, initial state y, 
h 

'^max 

output: y approximately distributed as N(0, A~^) 

set 7 = (^-1) A 

for k = l,..., feinax do 
sample z ~ N(0,1); 

X := M~^(Ny -|- yD^/^z); 
sample z ~ N(0,1); 
y := M“^(N’^x -|- yD^/^z) 

end 


We first encountered restricted versions of Corollary 4 for normal distributions in 
Amit and Grenander (1991) and in Barone and Frigessi (1990) where geometric conver¬ 
gence of the covariance matrices was established for the Gauss-Seidel and SOR split¬ 
tings. These and the SSOR splitting were investigated in Roberts and Sahu (1997) (who 
labelled the sampler REGS). 

Corollary 4 and Table 1 show that the Gibbs, SOR and SSOR samplers converge for 
any SPD precision matrix A. This summarizes results in Barone and Frigessi (1990); 
Galli and Gao (2001) and the deterministic sweeps investigated in Amit and Grenander 
(1991); Roberts and Sahu (1997); Liu et al. (1995). Corollary 4 generalizes these results 
for any matrix splitting A = M — N by guaranteeing convergence of the random iterates 
(10) to N(0, A^^) with convergence factor p(M“^N) (or p(M“^N)^ if /r = 0). 


5. Non-stationary iterative methods 

5.1. Acceleration of linear solvers by polynomials 

A common scheme in numerical linear algebra for accelerating a stationary method when 
M and N are symmetric is through the use of polynomial preconditioners (Axelsson, 
1996; Golub and Loan, 1989; Saad, 2003). Equation (5) shows that after k steps the 
error in the stationary method is a order polynomial of the matrix I — G = M~^A. 
The idea behind polynomial acceleration is to implicitly implement a different k*^ order 
polynomial P/c(M~^A) such that p(Pfc(M~^A)) < p((l — M~^A)^). The coefficients of 
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Pfc(M ^A) are functions of a set of acceleration parameters {{afc}, {Tfe}}, introduced by 
the second order iteration 

+ QffcTfeM~^(b — Ax^^^). (11) 

At the first step, ao = 1 and = x(°^ + ToM~^(b — Ax^°^). Setting ak = Tk = I 
for every k yields a basic un-accelerated stationary method. The accelerated iteration 
in (11) is implemented at a negligible increase in cost of 0(n) flops per iteration (due 
to scalar-vector multiplication and vector addition) over the corresponding stationary 
solver (3). 

It can be shown that (e.g., Axelsson (1996)) that the {k + 1)®* order polynomial Pk+i 
generated recursively by the second order non-stationary linear solver (11) is 


Pk+i (A) = {ak - akTkX) Pk (A) -|- (1 - ak) Pk-i (A). (12) 

This polynomial acts on the error = x^^'^ — A“^b by = Pk{'M.~^A)e^^\ which 

can be compared directly to (5). 

When estimates of the extreme eigenvalues Amin and Amax ofl — G = M“^A are 
available (Amin and Amax are real when M and N are symmetric), then the coefficients 
{Tk,ak} can be chosen to generate the scaled Chebyshev polynomials {Qk}, which give 
optimal error reduction at every step. The Chebyshev acceleration parameters are 


Tk 


2 

T Amin 


Pk 



(13) 


where ao = 1 and /3o = tq (Axelsson, 1996). Note that these parameters are independent 
of the iterates {x^^^}. Since M is required to be symmetric, applying Chebyshev accel¬ 
eration to SSOR is a common pairing; its effectiveness as a linear solver is shown later 
in Table 3. 

Whereas the stationary methods converge with asymptotic average convergence factor 
p(M“^N), the convergence factor for the Chebyshev method depends on cond(M”^A) = 
Amax/Amin- Specifically the scaled Chebyshev polynomial Qfe(A) minimizes niaxAg[A„i„,A„ax] Pk{X) 
over all k*^ order polynomials Pk, with 


max |Qfc(A)| 

[•^min 5-^max] 


2cr'= 

TT^’ 


(14) 


Since the error at the step of a Chebyshev accelerated linear solver is = 

Qk(X)e^^\ then the asymptotic convergence factor is bounded above by 


cr = 


1 ~ \/Amin/At] 


1 + a/ Amin/Ar 


(15) 


(p.181 Axelsson, 1996). Since cr € [0,1), the polynomial accelerated scheme is guaranteed 
to converge even if the original splitting was not convergent. Further, the convergence 
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factor of the stationary iterative solver is bounded below by p = (see e.g. 

Axelsson, 1996, Thm 5.9). Since a < p (except when Amin = Amax in which case cr = 
0), polynomial acceleration always reduces the convergence factor, so justifies the term 
acceleration. The Chebyshev accelerated iteration (11) is amenable to preconditioning 
that reduces the condition number, and hence reduces cr, such as incomplete Cholesky 
factorization or graphical methods (Axelsson, 1996; Saad, 2003). Axelsson also shows 
that after 


k* 


p ln(£/2) ^ 

In a 


(16) 


iterations of the Chebyshev solver, the error reduction is ||ei^ ^||A‘'/||e^°i||A‘' < £ for 
some real number v and any 0 < e < 1 (Axelsson, 1996, eqn 5.32). 


5.2. Acceleration of Gibbs sampling by polynomials 

Any acceleration scheme devised for a stationary linear solver is a candidate for acceler¬ 
ating convergence of a Gibbs sampler. For example, consider the second order stochastic 
iteration 


y(^-+i) = (1 - + afcrfcM-i(c('=) - Ay«) (17) 

analogous to the linear solver in (11) but now the vector b has been replaced by a 
random vector . The equivalence between polynomial accelerated linear solvers and 
polynomial accelerated samplers is made clear in the next three theorems. 

Theorem 5 Let A be SPD and A = M — N be a symmetric splitting. Consider a set of 
independent noise vectors with moments 

E(c('=)) = u and Var(c('=)) = OfcM -|- 

such that Ok := + {bk - 1) , bk ■= + 1, Kk+i := -h 

(1 — ak)Kk, cind Ki = tq. If the polynomial accelerated linear solver (11) converges to 
A“^b with a set of parameters {{ofc}, {'Tfc}} that are independent then the 

polynomial accelerated stochastic iteration (17) converges in distribution to a distribution 
n with mean A^^iz and covariance matrix A^^. Furthermore, if the are normal, 

then 

y^'^) 4 N(/x = A'^i/,A^^). 

Proof. Appendix A.2. □ 

Given a second order linear solver (11) that converges. Theorem 5 makes clear how to 
construct a second order sampler (17) that is guaranteed to converge. The next Corollary 
shows that the polynomial Pk that acts on the linear solver error — A^^b is the 
same polynomial that acts on the errors in the first and second moments of the sampler, 
E(yW)-A-l 1 / and Var(y(^^) ~ A ^ respectively. In other words, the convergence factors 
for a polynomial accelerated solver and sampler are the same. 
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Corollary 6 Suppose that the polynomial accelerated linear solver (11) converges with 
asymptotic convergence factor a = (limfc_>oo max>, where is the order 

polynomial recursively generated by (12). Then under the conditions of Theorem 5, 


E = Pfe(M-iA) (E(y(°)) - ^ A’V 

with asymptotic convergence factor a, and 

Var = A^^ - Pfc(M-^A) (^A’^ - Var(y(°^)) (Pfc(M-^A))'^ ^ A’^ 


with asymptotic convergence factor . 


Proof. Appendix A.2. □ 

Corollary 6 allows a direct comparison of the convergence factor for a polynomial 
accelerated sampler (a, or if /r = 0) to the convergence factor given previously for the 
corresponding un-accelerated stationary sampler (p(M“^N), or p(M^^N)^ if ^ = 0). In 
particular, given a second order linear solver with accelerated convergence compared to 
the corresponding stationary iteration, the corollary guarantees that the second order 
Gibbs sampler (17) will converge faster than the stationary Gibbs sampler (10). 

Just as Chebyshev polynomials are guaranteed to accelerate linear solvers, Gorollary 
6 assures that Ghebyshev polynomials can also accelerate a Gibbs sampler. Using Theo¬ 
rem 5, we derived the Ghebyshev accelerated SSOR sampler (Fox and Parker, 2014) by 
iteratively updating parameters via (13) and then generating a sampler via (17). Explicit 
implementation details of the Chebyshev accelerated sampler are provided in the supple¬ 
mentary materials. The polynomial accelerated sampler is implemented at a negligible 
increase in cost of 0(n) flops per iteration over the cost (4n^ flops) of the SSOR sampler 
(Algorithm 6). The asymptotic convergence factor is given by the next Corollary, which 
follows from Corollary 6 and equation (15). 


Corollary 7 If the Chebyshev accelerated linear solver converges, then the mean E(y(^")) 
of the corresponding Chebyshev accelerated stochastic iteration (17) converges to fi = 


A ^ 1 / with asymptotic convergence factor 


a /A n 
min /A n 


Var(y(^')) converges to A ^ with asymptotic convergence factor 


and the covariance matrix 

1 —\/An 


a /A„ 

min /A n 


Corollary 7 and (14) show that a Chebyshev accelerated normal sampler is guaranteed 
to converge faster than any other acceleration scheme that has the parameters {{rfc, Ofe}} 
independent of the iterates {y^^'^}. This result also shows that the preconditioning ideas 
presented in section 5.1 to reduce cond(M~^A) = Amax/Amin can also be used to speed 
up Chebyshev accelerated samplers. We do not investigate such preconditioning here. 

Corollary 7 and equation (16) suggest that, for any e > 0, after k* iterations the 
Chebyshev error reduction for the mean is smaller than e. But even sooner, after k** = 
fc*/2 iterations, the Chebyshev error reduction for the variance is predicted to be smaller 
than e (Fox and Parker, 2014). 
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Figure 2. Left panel: Location of non-zero elements in the 100 x 100 precision matrix A. Right 
panel: Relative error in covariance || || 2 /||A ~^||2 versus number of floating point oper¬ 

ations (flops) for a sampler implemented with SSOR and uj = 1, SSOR with optimal relaxation 
u) = 1.6641, and SSOR with Chebyshev acceleration. Also shown is the relative error and flop 
count for a sample drawn using Cholesky factoring. 


6. Computed Examples 

The iterative sampling algorithms we have investigated are designed for problems where 
operating by the precision matrix is cheap. A common such case is when the precision 
matrix is sparse, as occurs when modeling a GMRF with a local neighbourhood structure. 
Then, typically, the precision matrix has 0(n) non-zero elements, so direct matrix-vector 
multiplication has 0{n) cost. We give two examples of sampling using sparse precision 
matrices: first, we present a small n = 100 example where complete diagnostics can be 
computed for evaluating the quality of convergence; and second, we present a n = 10® 
Bayesian linear inverse problem that demonstrates computational feasibility for large 
problems. The samplers are initialized with = 0 in both examples. 

6.1. A 10 X 10 lattice example (n = 100) 

A first order locallii linear sparse precision matrix A, considered by Higdon (2006); 
Rue and Held (2005), is 

{ n, if z = j 

— 1 if i^j and ||si —Sj|| 2 <l . 

0 otherwise 

The discrete points {si} are on a regular lOx 10 lattice (n = 100) over the two dimensional 
domain S = [1,10] x [1,10]. Thus A is 100 x 100, ||A ||2 = 7.8 and jjS = A ^^||2 = 10^. 
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Table 3. The number of iterations and the total number of floating point operations performed by 
some common stationary and accelerated linear solvers, and the Cholesky factorization, used to solve 
Ax = b for fixed non-zero b. Each solver was run until the residual became sufficiently small, 

||b ~ Axt ^)||2 < 10“®. Details in section 6. 


solver 

U) 

e(M-iN) 

number of iterations 

flops 

Richardson 

1 

6.8 

DNG 

- 

Jacobi 

- 

.999972 

4.01 X 10® 

5.69 X 10'^ 

Gauss-Seidel 

- 

.999944 

2.44 X 10® 

4.34 X 10® 

SSOR 

1.6641 

.999724 

6.7 X 10^ 

2.39 X 10® 

SOR 

1.9852 

.985210 

1655 

2.95 X 10® 

Cheby-SSOR 

1 

.9786 

958 

3.41 X 10® 

Cheby-SSOR 

1.6641 

.9673 

622 

2.21 X 10® 

CG 

- 

.6375 

48 

9.22 X lO'^ 

CG-SSOR 

1.6641 

.4471 

29 

6.66 X lO'* 

Cholesky 

- 

- 

- 

1.35 X lO'* 


The sparsity of A is shown in the left panel of Figure 2. The scalar rii is the number 
of points neighbouring Si, i.e., with distance 1 from Si. Although v? = 10^, the number 
of non-zero elements of A is 0{n) (460 in this example). Since the bandwidth of A is 
a Cholesky factorization costs 0{'n?) flops (Rue, 2001) and each iteration of an 
iterative method costs 0{n) flops. 

To provide a comparison between linear solvers and samplers, we solved the system 
Ax = b using linear solvers with different matrix splittings (Table 1), where b is fixed 
and non-zero, all initialized with = 0. The results are given in Table 3. The Richard¬ 
son method does not converge (DNC) since the spectral radius of the iteration operator 
is greater than 1. The SOR iteration was run at the optimal relaxation parameter value 
of a; = 1.9852. SSOR was run at its optimal value of w = 1.6641. Chebyshev accelerated 
SSOR (Cheby-SSOR), CG accelerated SSOR (CG-SSOR) (both run with w = 1.6641) 
and GG utilize a different implicit operator for each iteration, and so the spectral radius 
given in these cases is the geometric mean spectral radius of these operators (estimated 
using (5)). Even for this small example, Chebyshev acceleration reduces the compu¬ 
tational effort required for convergence by about two orders of magnitude, while CG 
acceleration reduces work by nearly two more orders of magnitude. 

We investigated the following Gibbs samplers: SOR, SSOR, and the Chebyshev accel¬ 
erated SSOR. These samplers are guaranteed to converge since the corresponding solver 
converges (Theorem 1). Since the convergence factor for a sampler is equal to the con¬ 
vergence factor for the corresponding solver (Corollaries 3 and 6) then Gibbs samplers 
implemented with any of the matrix splittings in Table 1 exhibit the same convergence 
behavior as shown for the linear solvers in Table 3. Convergence of the sample covariance 
Ri Var(y*'^^) —>■ A^^, calculated using 10^ samples, is shown in the right panel of 
Figure 2 that displays the relative error ||A~^ — Sy^^|| 2 /||A “^||2 as a function of the flop 
count. Each sampler iteration costs about 2.24 x 10^ flops. This performance is compared 
to samples constructed by a Cholesky factorization, which cost 1.34 x 10"^ flops (depicted 
as the green vertical line in the right panel of Figure 2). Since the sample means were 
uniformly close to zero, error in the mean is not shown. 


imsart-bj ver. 2011/11/15 file: EGNMSPA_2015-03-25_arXiv.tex date: May 14, 2015 










22 


Fox and Parker 


The benchmark for evaluation of the of the convergence of the iterative samplers in 
finite precision is the Cholesky factorization, its relative error is depicted as the green 
horizontal line in the right panel of Figure 2. For this example, the iterative samplers 
produce better samples than a Cholesky sampler since the iterative sample covariances 
become more precise with more computing time. 

The geometric convergence in distribution of the un-accelerated SSOR samples to 
N(0, A~^) is clear in Figure 2, and even after 5 x 10® flops, convergence in distribution has 
not been attained. This is not surprising based on the large number of iterations (0(n^)) 
necessary for the same stationary method to converge to a solution of Ax = b (see 
Table 3). The accelerated convergence of the Chebyshev polynomial samplers, suggested 
by the fast convergence of the corresponding linear solvers depicted in Table 3, is also 
evident in Figure 2, with convergence after 1.70 x 10® flops (76 iterations) for the Cheby- 
SSOR sampler with optimal relaxation parameter uj = 1.6641, and the somewhat slower 
convergence at 2.37 x 10® flops (106 iterations) when w = 1. 

6.2. A 100 X 100 X 100 (n = 10®) linear inverse problem in 
biofilm imaging 

We now perform accelerated sampling from a GMRF in 3-dimensions, as a stylized 
example of estimating a voxel image of a biofilm from confocal scanning laser micro¬ 
scope (CSLM) data (Lewandowski and Beyenal, 2014). This large example illustrates 
the feasibility of Chebyshev accelerated sampling in large problems for which sampling 
by Cholesky factorization of the precision matrix is too computationally and memory 
intensive to be performed on a standard desktop computer. 

We consider the problem of reconstructing a 100 x 100 x 100 voxel image x of a bacterial 
bioHlm, i.e., a community of bacteria aggregated together as slime, given a subsampled 
100 X 100 X 10 CSLM data set y. For this exercise, we synthesized a ‘true’ image xt of a 
90^m tall ellipsoidal column of biohlm attached to a surface, taking value 10 inside the 
biofilm column, and 0 outside, in arbitrary units. Similar geometry has been observed 
experimentally for Pseudomonas aeruginosa biofilms (Swogger and Pitts, 2005), and is 
also predicted by mathematical models of biohlm growth (Alpkvist and Klapper, 2007). 
CSLM captures a set of planar ‘images’ at different distances from the bottom of the 
biohlm where it is attached to a surface. In nature biohlms attach to any surface over 
which water hows, e.g., human teeth and creek bottoms. Each horizontal planar image 
in this example is 100 x 100 pixels; the distance between pixels in each plane is typically 
about 1 ^m, with the exact spatial resolution set by the microscope user. The vertical 
distance between planar slices in a CSLM image is typically an order of magnitude 
larger than the horizontal distance between pixels; for this example, the vertical distance 
between CSLM planes is lO^m. 

Civen the ‘true’ image Xt, we generated synthetic 100 x 100 x 10 CSLM data by 

y = Fxt -k e 
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Figure 3. The left panel depicts a 100 x 100 x 10 pixelated confocal scanning microscope image, 
y of a simulated ellipsoidal column of a bacterial biofilm; the distance between horizontal pixels 
is l^m, the distance between vertical pixels is 10/rm. The right panel shows a surface rendering 
of a sample from the n = 10® dimensional multivariate normal posterior distribution conditioned 
on hyperparameters. 


where the 10® x 10® matrix F arithmetically averages over 10 pixels in the vertical dimen¬ 
sion of X, to approximate the point spread function (PSF) of CSLM (Sheppard and Shotton, 
1997), and e ~ N(0, = I). The data is displayed in the left panel of Figure 3 as layers 

of pixels, or ‘slices’, located at the centre of sensitivity of the CSLM, i.e. the centre of 
the PSF. Thus, the likelihood we consider is 7r(y|x) = N(Fx,P“^). 

To encapsulate prior knowledge that the bacteria in the biofilm aggregate together 
we model x by the GMRF x ^ A^(0, Q)j®) where the precision matrix Q/j models local 
smoothness of the density of the biofilm and background. We construct the matrix Q/j 
as a sparse inverse of the dense covariance matrix corresponding to the exponential 
covariance function. This construction uses the relationship between stationary Gaussian 
random fields and partial differential equations (PDFs) that was noted by Whittle (1954) 
for the Matern (or Whittle-Mat&n (Guttorp and Gneiting, 2005)) class of covariance 
functions, that was also exploited by Gui et al. (2011) and Lindgren et al. (2011). Rather 
than stating the PDF, we find it more convenient to work with the equivalent variational 
form, in this case (the square of) 


where x is a continuous stochastic field, dv is the volume element in the domain T? and 

ds is the surface element on the boundary dT>. This form has Fuler-Lagrange equations 

being the Helmholtz operator with (local) Robin boundary conditions x + = 0 on 

2 

dT>, induced by the ^ term. In our example we apply the Hessian of this form twice, 
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X 



0 0 

Figure 4. Contours of the effective covariance function centred on the cubic domain, logarith¬ 
mically spaced in value. 


which can be thought of as squaring the Helmholtz operator. When the quadratic form 
is written in the operator form Q (x) = Hx, where H is the Hessian, the resulting 
Gaussian random field has density 

TT (x) oc exp {—. (18) 

We chose this operator because the discretized precision matrix is sparse, while the 
covariance function (after scaling) is close to exp{—r/i?}, having length-scale R. 

The GMRF over the discrete field x is then defined using FEM (finite element method) 
discretization; we used cubic-elements between nodes at voxel centres in the cubic do¬ 
main, and tri-linear interpolation from nodal values within each element. To verify this 
construction we show in Figure 4 contours of the resulting covariance function, between 
the pixel at the centre of the normalised cubic domain [0,1]^ and all other pixels, for 
length scale R = 1/4. The contours are logarithmically spaced in value, hence the evenly 
spaced spherical contours show that the covariance indeed has exponential dependence 
with distance. The contours look correct at the boundaries, indicating that the local 
Robin boundary conditions^ give the desired covariance function throughout the do¬ 
main. In contrast, Dirichlet conditions would make the cubic boundary a contour, while 
Neumann conditions as used by Lindgren et al. (2011) would make contours perpen¬ 
dicular to the cubic boundary; neither of those pure boundary conditions produce the 
desired covariance function. 

In the deterministic setting, this image recovery problem is an example of a linear 
inverse problem. In the Bayesian setting, we may write the hierarchical model in the 

^Local boundary conditions are approximate but preserve sparseness. The exact boundary conditions 
are given by the boundary integral equation for the exterior Helmholtz operator, resulting in a dense 
block in H that is inconvenient for computation (Neumayer, 2011). 
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general form 

y|x,0 ~ N(Fx,P^1) (19) 

x|0 ^ N(/x,Q-i) (20) 

e ~ TT{e) ( 21 ) 

where 0 is a vector of hyperparameters. This stochastic model occurs in many set¬ 
tings (see, e.g., Simpson et al., 2012; Rue and Held, 2005) with y being observed data, 
X is a latent field, and 0 is a vector of hyperparameters that parameterize the precision 
matrices P and Q. The (hyper)prior tt [6) models uncertainty in covariance of the two 
random Helds. 

There are several options for performing sample-based inference on the model (19), 
(20), (21). Most direct is forming the posterior distribution 7r(x, 0|y) via Bayes’ rule 
and implementing Markov chain Monte Carlo (MCMC) sampling, typically employing 
Metropolis-Hastings dynamics with a random walk proposal on x and 6. Such an algo¬ 
rithm can be very slow due to high correlations within the latent field x, and between 
the latent field and hyperparameters 9. More efficient algorithms block the latent field, 
noting that the distribution over x given everything else is a multivariate normal, and 
hence can be sampled efficiently as we have discussed in this paper. Higdon (2006) and 
Bardsley (2012) utilized this structure, along with conjugate hyperpriors on the com¬ 
ponents of 9, to demonstrate a Gibbs sampler that cycled through sampling from the 
conditional distributions for x and components of 9. When the normalizing constant for 
TT (x, 0|y) is available, up to a multiplicative constant independent of state, a more effi¬ 
cient algorithm is the one block algorithm (Rue and Held, 2005, section 4.1.2) in which 
a candidate 9' is drawn from a random walk proposal, then a draw x' ^ 7r(x'|y, 9'), with 
the joint proposal (0^,x') accepted with the standard Metropolis-Hastings probability. 
The resulting transition kernel in 9 is in detailed balance with the distribution over 9\y, 
and hence can improve efficiency dramatically. A further improvement can be to perform 
MCMC directly on -k (9\y) as indicated by Simpson et al. (2012), with subsequent in¬ 
dependent sampling x ~ tt (x|y, 9) to facilitate Monte Carlo evaluation of statistics. In 
each of these schemes, computational cost is dominated by the cost of drawing samples 
from the large multivariate normal x ^ tt (x|y, 0). We now demonstrate that sampling 
step for this synthetic example. 

In our example, the distribution over the 100 x 100 x 100 image x, conditioned on 
everything else, is the multivariate normal 

7r(x|y,0 = R) = N(x;/x = A-iF^Py,S=A-i) (22) 

with precision matrix A = F^PF -|- (cf. Calvetti and Somersalo (2007); Higdon 
(2006)). For this calculation we used the same covariance matrix as shown above, so 
R = 1/4 in units of the width of the domain, though for sample-based inference one 
would use samples from the distribution over i?|x, y. The right panel of Figure 3 depicts 
a reconstructed surface derived from a sample from the conditional distribution in (22) 
using the Chebyshev polynomial accelerated SSOR sampler. The sampler was initialized 
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with the precision matrix A, E(c(^^) = F^Py for all /c, and relaxation parameter w = 1. 
The contour is at value 6, after smoothing over 3x3x3 voxels, displaying a sample 
surface that separates regions for which the average over 3x3x3 voxel blocks is less 
than 6 (outside surface) and greater than 6 (inside surface). As can be seen, the surface 
makes an informative reconstruction of the ellipsoidal phantom. 

Using CG, estimates of the extreme eigenvalue of Mgg^Qj^A were Amin = 4.38 x 10“® 
and Amax = 1 — 1-36 x 10“®. By Corollary 6, the asymptotic convergence factors for 
the Chebyshev sampler are a k, 0.9958 for the mean and cr^ = .9917 for the covariance 
matrix. Using this information, equation (16) predicts the number of sampler iterations 
until convergence. After k* = 4566 iterations of the Chebyshev accelerated sampler, it is 
predicted that the mean error is reduced by e = 10“®; that is 

||/x-E(y(^-*))||2«10-®||/x-E(y(°))||2. 

But even sooner, after only /c** = A:*/2 = 2283 iterations, it is predicted that the covari¬ 
ance error is 

||A“i - Var(y('=**))|| « 10“®||A“^ - Var(y(°))||. 

Contrast these Chebyshev polynomial convergence results to the performance of the 
non-accelerated stationary SSOR sampler that has convergence factors p(M“^N) Ri 1 — 
Amin = 1 —4.38x 10“® for the mean error, and p(M“^N)^ = 1 —8.76x 10“® for covariance 
error. These convergence factors suggest that after running the non-accelerated SSOR 
Gibbs sampler for only 4566 iterations, the covariance error will be reduced to only 
R! 0.96 of the original error; 1.9 x 10® iterations are required for a 10“® 

reduction. 

The cost difference between the Cholesky factorization and an iterative sampler in this 
example is dramatic. After finding a machine with the necessary memory requirements, 
the Cholesky factorization would cost about b^n = 10^® flops (since the bandwidth of 
the precision matrix A is about b = 10®). Since the number of non-zero elements of A is 
3.3 X 10®, an iterative sampler costs about 6.6 x 10® flops per iteration, much less than 
n^. The sample in Figure 3 was generated by /cmax = 5 x 10® iterations of the Chebyshev 
accelerated SSOR sampler, at a total cost of 3.3 x 10®^ flops, which is about 10"*^ times 
faster than Cholesky factoring. 


7. Discussion 

This work began, in part, with a curiosity about the convergence of the sequence of 
covariance matrices in Gibbs sampling applied to multivariate normal distributions, as 
studied by Liu et al. (1995). Convergence of that sequence indicates that the algorithm is 
implicitly implementing some factorization of the target covariance or precision matrix. 
Which one? 

The answer was given by Goodman and Sokal (1989), Amit and Grenander (1991), 
Barone and Frigessi (1990), and Galli and Gao (2001), that the standard component- 
sweep Gibbs sampler corresponds to the classical Gauss-Seidel iterative method. That 
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result is given in section 4.2, generalized to arbitrary matrix splittings, showing that 
any matrix splitting used to generate a deterministic relaxation also induces a stochas¬ 
tic relaxation that is a generalized Gibbs sampler; the linear iterative relaxation and 
the stochastic relaxation share exactly the same iteration operator, conditions for con¬ 
vergence, and convergence factor, which may be summarized by noting that they share 
exactly the same error polynomial. 

Equivalence of error polynomials is important because they are the central object 
in designing accelerated solvers including the multigrid, Krylov space, and parallel al¬ 
gorithms. We demonstrated that equivalence explicitly for polynomial acceleration, the 
basic non-stationary acceleration scheme for linear solvers, showing that this control of 
the error polynomial can be applied to Gibbs sampling from normal distributions. It 
follows that, just as for linear solvers, Chebyshev-polynomial accelerated samplers have 
a smaller average asymptotic convergence factor than their un-accelerated stationary 
counterparts. 

The equivalences noted above are strictly limited to the case of normal target dis¬ 
tributions. We are also concerned with continuous non-normal target distributions and 
whether acceleration of the normal case can usefully inform acceleration of sampling 
from non-normal distributions. Gonvergence of the unaccelerated, stationary, iteration 
applied to bounded perturbations of a normal distribution was established by Amit 
(1991), though carrying over convergence rates proved more problematic. 

There are several possibilities for extending the acceleration techniques to non-normal 
distributions. A straightforward generalization is to apply Gibbs sampling to the non¬ 
normal target, assuming the required conditional distributions are easy to sample from, 
though using the directions determined by the accelerated algorithm. Simply applying 
the accelerated algorithm to the non-normal distribution does not lead to optimal accel¬ 
eration, as demonstrated by Goodman and Sokal (1989). 

A second route, that looks more promising to us, is to exploit the connection between 
Gibbs samplers and linear iterative methods that are often viewed as local solvers for 
non-linear problems, or equivalently, optimizers for local quadratic approximations to 
non-quadratic functions. Since a local quadratic approximation to logTr is a local Gaus¬ 
sian approximation to tt, the iterations developed here may be used to target this local 
approximation and hence provide local proposals in an MGMG. We imagine an algorithm 
along the lines of the trust-region methods from optimization in which the local quadratic 
(Gaussian) approximation is trusted up to some distance from the current state, imple¬ 
mented via a distance penalty. One or more steps of the iterative sampler would act as a 
proposal to a Metropolis-Hastings accept/reject step that ensures the correct target dis¬ 
tribution. Metropolis adjusted Langevin (MALA) and hybrid Monte Carlo (HMC) turn 
out to be examples of this scheme (Norton and Fox, 2014), as is the algorithm presented 
by Green and Han (1992). This naturally raises the question of whether acceleration of 
the local iteration can accelerate the Metropolised algorithm. This remains a topic for 
ongoing research. 
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Appendix A: Appendix 

A.l. Stationary sampler convergence (Proof of Theorem 2 and 
Corollary 3) 

First, the theorem and corollary are established for the mean. Since A = M —N is a con¬ 
vergent splitting, then (10) and Theorem 1 show that = v if and only if 

A~^iy with the same convergence factor as for the linear solver. To establish convergence 
of the variance, let G = M~^N in (10), then 

This equation and the independence of show that Var(y*'^^ |y(°)) = (G®M“^ Var(c®)M“^(G®)^) . 

Theorem 1 establishes the existence of a unique limiting distribution with a non-zero co- 
variance matrix P. Thus, for 11, (10) implies 

r = GPG^ -k M-1 Var(c(®) )M-'^ (23) 

since y*-®^ and c*^®) are independent. Thus Var(y^^^ |y*^°)) = P — G^P(G*^)^, and so 

Var(y^''^) = P - G'® (p - Var(y^°^)) (G^f. (24) 

That is, Var(y(^')) —P with convergence factor p(M“^N)^. To prove that part (b) of the 
theorem implies part (a), consider the starting vector y^*^^ ^ 11 with covariance matrix 
P = Since is independent of the relation (23) shows that Var(c*^^^) = 

M(A-i - GA^1 g'^)M^ = MA^^M^ - NA^^N^. Substituting in N = M - A shows 
that Var(c*^^')) = -|- N. To prove that (a) implies (b), consider y(°) ^ N(/.t, A~^). By 

(24), P-Var(y(i)) = G(P-A^i)G^. Substituting Var(c(°)) = M(A-i -GA"^G^)M^ 
into equation (23) shows P —A^^ = G(P —A^^)G^. Thus Var(y(^)) = A^^, which shows 
that Var(y*^^)) has converged to A^^. By Theorem 1, P = A~^. 

A.2. Polynomial accelerated sampler convergence (Proof of 
Theorem 5 and Corollary 6) 

If the polynomial accelerated linear solver (11) converges, then E(y^^~^^^) —>■ A~^i?(c^^^) = 

(Ji. To determine Var(c*^^)) rewrite the iteration (17) as yl^+^l = (1 — Q!fe)y*^^“^^ -|- 
-k ttfe (MW)”^cW where = ;AM, - A, and G^'®) = 

I - Tfc (MW)"^ A = (MW)"^NW. First, we will consider y(®-i) ~ N(/i, A~^) and 
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then find Var(c*^^)) that will guarantee that ~ N(/i, A ^). Since {c^*^} are 

independent of {y^®^}, the above equation for y(^+i) shows that Var(c(^)) is equal to 

((1 - (1 - akf)A-^ - 2(1 - 

where := Cov(y*^^“^\ To simplify this expression, we need Lemma 8, which 

gives explicitly. Parts (1) and (2) of the lemma show that 

Var(c('®)) = (a^(A-i - G^'®)+ 2(1 - ak)ak{A~^ - G^ 

Part (3) of Lemma 8 shows that Var(c*^^^) has the form specified in the theorem. 
Lemma 8 For a symmetric splitting A = M — N, 

1. kw is symmetric. 

2. = gL^^A'^, where G^f^ = I - k^M'^A and Kk+i '■= afcU- + (1 - ctk)Kk- 

3. A~^ - GrA^^Gl = rKM-i((l/T + l/K)M-A)M-h 

Proof. To nail down rewrite the Chebyshev iteration (17) as 

Y(fe+i) = f (1 -ak) I \ y^k) ^ f gW \ 


where ^ ^ ^Y^k+i) ^ ^ g(fe) ^ (m('=)) Letting 

= ('7^)1) shows Jt 

Var (y^'^+D) = c;('=)Var + al [J ) . 


(25) 

If Var (y(°)) = A”! then Var (y^'^)) = A'^ for fc > 1 in which case Var ig 

= s<‘>( 3,,ir n: )C<‘V+»)26) 


A-1 

j^(fc+l)T 


By definition of Y*^°\ = 0; for fc > 0, 

K('=+i) = OfcG^'®)A'l + (1 - ofe) kW^. (27) 

Since oq = 1, then = G^°^A~^ which proves parts (1) and (2) of the Lemma for 
fc = 0 since ki = tq and Glf^A“^ is symmetric. Assuming that = G^f^A”^ for 
fc > 0, the recursion in (27) gives ~ [c^kTk + (1 — afc) Kfe] M“^A) A“^ so the 

expansion and recursion hold for fc + 1, and parts (1) and (2) of the Lemma follow by 
induction. Part (c) of the Lemma follows from the equation 

A^i - G^A-^G^ = M-i {MrA~^Mj) - M-^N^A-^ . □ 
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The selection of Var(c^^^) = OfeM + 6fcN assures that if Var = A then 

Var for fc > 1. Thus, subtracting (26) from (25) gives 


Var 


|'y('=+i 


i))_ 


A-i 

j^(fc+l)T ^-1 


= e<‘)(v.r(Y(‘>)-( 


or = gik)gik)gik)T k > 0, where = Var(Y('=)) - ^ 


Hence. 


by recursion, £(’^'> = (nfj; £^°'> (nfjo ■ 


K(U 

A-i 


A-i 

k('=) \ 

K(Ur 

A-i ) 

that satisfies 


with iP(i) = g^°'> = (^ Q ^ • Thus 

+ (1 - at) =ak{l- r^M-^A) + (1 - a^) 


with = 0*-°^ which shows that by (12). Furthermore, this shows 

that the error in variance at the iteration has the specified form and convergence 
factor. 
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